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We present parallelization of a quantum-chemical tree-code [J. Chem. Phys. 106, 5526 (1997)] 
for linear scaling computation of the Coulomb matrix. Equal time partition [J. Chem. Phys. 118, 
9128 (2003)] is used to load balance computation of the Coulomb matrix. Equal time partition is a 
measurement based algorithm for domain decomposition that exploits small variation of the density 
between self-consistent-field cycles to achieve load balance. Efficiency of the equal time partition is 
illustrated by several tests involving both finite and periodic systems. It is found that equal time 
partition is able to deliver 91 - 98 % efficiency with 128 processors in the most time consuming 
part of the Coulomb matrix calculation. The current parallel quantum chemical tree code is able 
to deliver 63 - 81% overall efhciency on 128 processors with fine grained parallelism (less than two 
heavy atoms per processor). 
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I. INTRODUCTION 

Self-consistent-field (SCF) theories such as Density 
Functional Theory and hybrid Hartree-Fock/Density 
Functional Theory are accurate and computationally ef- 
ficient. Traditional Gaussian-orbital quantum chemistry 
codes that use conventional methods[l| are usually re- 
stricted to small systems since these methods have steep 
scaling of 0{N'^~^) with respect to system size, N. Re- 
cently, significant progress has been made in the de- 
velopment of 0{N) methods that overcome these bot- 
tlenecks. These methods include cornputation of the 
Hartree-Fock exchange matrix SJl, 0, |S_HJ3, the 
Coulomb matrix i 1 [13, M^MMM- the 
exchange-correlation matrix [T^ [l^, [13; HM H^ i and it- 
erative alternatives to eigensolution of the SCF equa- 
tions iMillllllllMlailll. 

With the advent of parallel multi-processor computers, 
especially those based on commodity processors, there 
has been a great effort in the comm unity to parallelize 
quantum chemistr:v 

methods hold promise for large scale computations given 
the fact that with parallel linear scaling methods, an n- 
fold increase in processors should lead to an n-fold in- 
crease in simulation capability. However, this holds only 
for scalable algorithms. 

Two of the most computationally demanding parts in 
a density functional application are calculation of the 
exchange-correlation and Coulomb matrices. The 0{N) 
exchange-correlation matrix calculation has been effi- 
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ciently parallelized through the concept of equal time 
(ET) partition^. In this work, the ET partition is 
extended to load balancing calculation of the Coulomb 
matrix. 

Linear scaling computation of the Coulomb matrix 
has been achieved via the quantum-chemical tree-code 
(QCTC)[l3 m H and the continuous Fast Multi- 
pole Method (CFMM)|i,|l[l2. Both the tree-code|4l| 
and the Fast Multipole Methodji^ E3 were originally 
proposed to handle the astrophysical A''-body prob- 
lem. Parallelization of these A^-body algorithms has 
been an active area of research in the computer science 
community in EHiililHIiilliElii. Even 
though both the iV-body problem and the Coulomb ma- 
trix calculation share many similarities, especially in han- 
dling the far-field multipole contribution, little work has 
been done on the parallel Coulomb matrix calculation 
beyond the simple master-slave approach[32|3J,|33]. It 
should be pointed out that parallel 0{N) computation of 
the Coulomb matrix is highly irregular relative to paral- 
lel 0{N^) computation of the two-electron Coulomb inte- 
grals, where the jobs are significantly coarse grained, en- 
abling the master-slave approach to work well. It is well- 
known that the master-slave approach faces potential 
contention and load imbalance problems for fine grained 
parallelism[53j (a small ratio of work load to number of 
processors). These problems have indeed been observed 
in quantum chemical calculations ^30. j4Q] . so alternatives 
are needed. One may use the idea of counting the num- 
ber of interactions in parallel A^-body codes to load bal- 
ance computation as in the orthogonal recursive bisec- 
tion (ORB) 44] or Costzones methods0jE3- However, 
due to cost irregularities associated with different Gaus- 
sian extents, angular symmetries, and non-uniform access 
patterns, simple counting is not an optimal approach to 
load balance computation of the Coulomb matrix. 



In this work, our main emphasis is on load balanc- 
ing the most time consuming part of QCTC, which is 
traversal of the density tree for evaluation of Coulomb 
matrix elements. To load balance this highly irregular 
tree traversal, we use the equal time (ET) partition J43|, 
which was originally proposed to parallelize computation 
of the exchange-correlation matrix. 

Equal time partition works by measuring the time 
spent in computational sub-domains {e.g. a line, area, 
or volume) during one SCF cycle. At the end of the 
calculation, the time spent in each sub-domain is used 
to predict a new overall domain decomposition for the 
next SCF cycle, where each new sub-domain ideally in- 
curs the same amount of work in the next SCF cycle. 
The predicted domain decomposition will deliver an im- 
proved load balance in the next SCF cycle when there 
is a smooth variation of the workload between successive 
SCF cycles {e.g. due to small changes in the electron 
density). In this way, temporal locality 54] of the prob- 
lem is exploited to achieve a continuously improved load 
balance. 

In serial, the time to build the density tree constitutes 
about 2% or less of the total time spent in QCTC. Un- 
fortunately, Amdahl's law dictates that the performance 
of a massively parallel program is ultimately determined 
by its serial parts. Therefore we also need to consider 
parallel construction of the density tree. Again, ideas 
from parallel TV-body codes may be useful. The construc- 
tion of locally essential trees, which are just sufficient for 
tree traversal on each processor, |4^ avoids the problem 
of replicating the total density tree on each processor. 
Hashed oct-trees|4^ IS^ also solve the replication prob- 
lem, where hash tables are used to allow the program to 
access data in an efficient manner across multiple proces- 
sors. However, due to fact that these approaches entail 
significant code restructuring, for the present case, we 
have chosen a parallel replicated density tree approach. 

The remainder of this paper is organized as follows: In 
Section^lwe discuss our strategy to efficiently parallelize 
computation of the Coulomb matrix J. In Section IIIII 
we describe a computational implementation of parallel 
QCTC. In Section im we discuss results of speedup tests 
performed on a few representative finite and periodic sys- 
tems. In Section [3 we summarize the main conclusions 
of the paper. 



II. PARALLELIZATION OF QCTC 

The quantum-chemical tree-code (QCTC) for 0{N) 
calculation of the Coulomb matrix has been fully de- 
scribed in Refs. [lj| and |la]. Here we only highlight 
essential aspects of the algorithm so that discussions of 
parallelization of QCTC may be made. 

The Coulomb matrix element in a finite (gas phase) 
case is given by 



where the charge distribution [Sg (or simply the dis- 
tribution) /5ah(r), is a product of the (Gaussian) basis 
functions 0a (r) and 0b (r). The total density of the 
system, which includes both the electronic and nuclear 
parts, is denoted by ptot(r)- In QCTC, a hierarchical 
multipole representation of the electron density, called 
a density tree, is stored in an advanced fc-d tree data 
structure 57|, ,5^ |5S]. A compact representation of the 
density in terms of Hermite-Gaussian (HG)[ij, [l^ |63 
basis has been used. With the help of the density tree, 
QCTC re-expresses the matrix element in Eq. (0 as a 
sum of near-field (NF) and far-field (FF) terms |15| 
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where M^ is the irregular solid harmonic interaction ten- 
sor, 0^[f] — /drO^(r)/(r) is a moment of the reg- 
ular solid harmonics, Q runs over the all nodes in the 
density tree as determined by penetration admissibil- 
ity criterion (PAC) and multipole admissibility criterion 
fMAO flilllSl and q runs on the left over near-field prim- 
itive distributions in the density. For the periodic case, a 
periodic far-field term and a tin-foil boundary condition 
term[ll|H3 are added to the RHS of Eq. (01. 

Essential to QCTC is construction of the total den- 
sity tree. Once the density tree is built, calculation of 
the Coulomb matrix elements proceeds by transversing 
the tree and checking the PAC and MAC. When both 
PAC and MAC are met, the far-field contribution is cal- 
culated via the multipole approximation. The near-field 
contribution, however, is calculated analytically. 

From Eq. 121, it is easy to see that 0{N) computation 
of the Coulomb matrix is highly irregular; near- and far- 
field contributions are determined on the fly via a PAC 
and MAC that depends on both the distributions and the 
density. This poses a challenge to efficiently load balance 
parallel tree traversal in QCTC. 



A. Load balancing tree traversals 

Since ET partition involves measuring workload using 
a timer (e.g. the MPI.WTime function in the message- 
passing libary MPI|62]), it is important to decide what 
work load information to time so that the timing pro- 
cess itself does not incur too much overhead. This may 
be achieved by recording the time to traverse the tree 
for each distribution (i.e. Pafc(r) in Eq. Q). The time 
and position of each distribution are stored in an array 
on each processor to facilitate partitioning of the 3-D 
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bounding box (the root bounding box) that encloses all 
distributions. Equal time partition[40J is performed on 
the root bounding box to achieve equal time or cost in 
all sub-boxes (also called ET sub-boxes). ET partition 
creates ET sub-boxes by recursively partitioning a box 



into 2 sub-boxes such that each sub-box carries approx- 
imately the same amount of time. At the end of the 
procedure we obtain 2" ET sub-boxes, where n is an in- 
teger greater than zero. Assuming that the number of 
processors is 2", each processor will handle one of the 
2" ET sub-boxes in the next SCF cycle. The restric- 
tion of a power of two for the number of processors may 
be removed by using a general ET partitioning scheme 
detailed in the Appendix of Ref.[40|. We have used a 
robust bisection method|2j| to find the plane which ap- 
proximately divides the workload into half. We empha- 
size again the main difference between our ET scheme 
and other parallel TV-body codes [ij, |43 is that we use 
an exact load timing information rather than counting 
the number of interactions as in the orthogonal recursive 
bisection (0RB)|4J] and Costzones methods {47ij. 

For the periodic case, the time associated with each 
distribution includes the time to handle the periodic far 
field 15, 61] contribution. In this way, ET partition nat- 
urally uses combined timing information to load bal- 
ance computation, extending the power of ET partition 
to situations where different timing information may be 
grouped together. 

A working hypothesis of the ET partition applied to 
QCTC is that the sum of distribution times for each ET 
sub-box is constant irrespective of the sectioning. How- 
ever, for very fine grained parallelism, shifting a bisecting 
plane may induce a relatively large change in the total 
predicted distribution time in a sub-box. This is due 
to the fact that the total workload may not be equally 
divided among the sub-boxes because the distribution 
times are discrete (a distribution is either totally or not 
in a sub-box). Also, for very fine grained parallelism, the 
total work in a sub-box is more sensitive to a change in 
density that may also increase the load imbalance, an ef- 
fect which we have experienced in parallelization of the 
exchange-correlation matrix[4QJ . 

In the first cycle, there is no previous timing on which 
to base the ET. In such a case, we use a reasonable heuris- 
tic where each processor handles an approximately equal 
number of distributions (the total number of distribu- 
tions may not be exactly divisible by the number of pro- 
cessors). 



B. Parallel density tree build 

In principle, an efficient parallelization of the density 
tree build should make use of the fact that each processor 
is handling only part of the total distributions in an ET 
sub-box. Depending on the collection of distributions on 
each processor, a locally essential tree|43,l3 ™^y ^'^ '^o'^" 
structed which is just sufficient for the tree traversals of 
all the distributions on a processor, thus avoiding replica- 
tion of the entire density tree and enabling efficient use of 
the memory space. However, without an extensive pro- 
gramming task, it may be difficult to predetermine which 
part of the entire density tree is needed for construction 



of the locally essential tree. As a first attempt, we have 
chosen a simple approach to parallelize the entire density 
tree build. 

For simplicity of programming we assume the number 
of processors to be 2*^, where k is an integer greater than 
zero. Observing that there are 2*^ subtrees at the kth 
tier in the entire density tree, our current implementation 
adopts a simple scheme where each processor builds one 
of the /cth-ticr subtrees in the total density tree. When 
all processors have built a fcth-tier subtree, an all-to-all 
exchange is carried out where all processors get the rest 
of the /cth-tier subtrees so that a final "merging" up of 
the subtrees [lj;ll3 can be performed to obtain the entire 
density tree. 

Inefficiency of the current implementation of the paral- 
lel density tree build in the limit of a large number of pro- 
cessors is expected. The all-to-all exchange of data be- 
tween processors is expensive and does not scale with the 
number of processors. Also, after collecting the subtrees 
from all other processors, a processor has to "merge" 
more subtrees upward as the number of processors is in- 
creased. This will inevitably introduce more overhead 
as we use more processors. Even if one can overcome 
the all-to-all exchange problem, one still faces a problem 
where it may be wasteful in the use of memory to store 
the entire density tree on each processor. However, while 
the present parallel density tree build may be replaced by 
more sophisticated schemes, where locally essential trees 
are built [4J| or hashed trees are used|^^a|, the current 
implementation delivers very good speedups up to the 
128-processor level. 

As a side note, our first implementation of an ET par- 
allel QCTC tried to avoid the problem mentioned above 
by partitioning the entire density into disjoint local den- 
sities. Each processor then built a local tree based on 
the local density. However, since a distribution on one 
processor does not "see" other local density trees, every 
processor had to loop through all distributions and the 
resulting partial Coulomb matrices had to be resummcd 
using an all-to-all communication at the end of the cal- 
culation. This turned out to be practical only below the 
64-processor level. The speedup did not increase with 
more processors because the total intrinsic cost (i.e. the 
amount of useful work) of QCTC increases rather rapidly 
with the number of processors. The rapid increase of in- 
trinsic cost has at its root a break down of the hierarchi- 
cal multipole approximation, as physically close charges 
can no longer be grouped when they reside on different 
processors. Asymptotically, as the number of processors 
approach the number of charges, one reverts to the ex- 
pensive 0{N'^) algorithm. For periodic systems, where 
one has to visit the density tree many more times (loop- 
ing through periodic images) relative to the finite case, 
the speedup stagnates once we pass a certain number of 
processors. Since this version of the parallel QCTC does 
not scale with the number of processors, we do not con- 
sider it further in this work. 



III. IMPLEMENTATION 

We have implemented a parallel QCTC algorithm in 
MONDOSCF [63, a suite of programs for linear scahng 
electronic structure theory and ab initio molecular dy- 
namics. MONDOSCF has been written in Fortran 90/95 
with the message-passing library MPI 62]. Timings are 
performed using the MPI_WTIME function. 



IV. RESULTS 

We have performed scaling tests on both finite and pe- 
riodic systems. For the finite systems, we have chosen 
taxol (C47H51NO14) and 2 water clusters as test cases. 
For the periodic systems, we have chosen pentaerythritol 
tetranitrate (PETN)[63 and the J-phase of octahydro- 
l,3,5,7-tetranitro-l,3,5,7-tetrazocine (5-HMX)^| as rep- 
resentative test cases. These systems are chosen because 
they are highly inhomogeneous, three-dimensional (and 
two are periodic) systems posing a challenge to parallel 
QCTC. All runs were performed on a cluster of 256 4 
CPU HP/Compaq AlphaServer ES45s with the Quadrics 
QsNet High Speed Interconnect. 

For the purpose of performing the scaling tests, we 
start the calculation with the ST0-3G basis set and a low 
accuracy, and switch to the final basis set and accuracy 
using a mixed integral approach, and run for three SCF 
cycles. The density matrix P is saved to disk and scaling 
tests of parallel QCTC are performed. This procedure 
may not be necessary. However, we are confident that 
the timings are representatives of a routine calculation. 

The result of the taxol scaling test is shown in Fig. ^ 
The calculations are performed with the 6-31G and 6- 
31G** basis sets, and a GOOD accuracyli^. The results of 
two different speedups are presented. The first speedup, 
called the FT speedup, measures the efficiency of the 
ET partition for the Coulomb matrix element calculation 
by traversing the density tree (see Section IH A|l and is 
defined by 
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where i^ is the time to evaluate the matrix elements by 
traversing the density tree with n processors. Notice that 
the speedups are relative to a 2-processor calculation. 
The second speedup, called the QCTC speedup, mea- 
sures the overall efficiency of parallel QCTC. From Fig.Q] 
it is observed that the ET speedup is excellent up to 64 
processors. Efficiency at the 64-processor level (where 
the number of heavy atoms per processor is less than 1) 
is at least 94%. The overall parallel QCTC speedup is 
very good up to 32 processors but degrades slightly at the 
64-processor level. The overall parallel QCTC efficiencies 
at 64 processors are 77.6% and 83.0% for the 6-31G and 
6-31G** basis sets, respectively. 
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FIG. 1: Scaling of parallel QCTC on taxol (C47H51NO14) 
RBLYP/6-31G and RBLYP/6-31G**. Speedups are relative 
to a 2-processor calculation. The labels ET and QCTC denote 
equal time and overall parallel QCTC speedups, respectively. 



The loss of efficiency is due to the fact that the time for 
parallel density tree build docs not decrease at the same 
rate (i.e. divided by the number of processors) as the tree 
traversal part. In Fig. |21we present the tTB/trT ratio as 
a function of the number of processors for calculations 
on taxol (along with other systems for comparisons to 
be made later). We note that if the time for parallel 
tree build Itb were to decrease at the same rate (i.e. 
divided by the number of processors) as the time for tree 
traversal txT as we increase the number of processors, 
then the tTs/tTT ratio would remain nearly constant. 
However, Fig. |21 shows that the tTB/trT ratio increases 
steadily as the number of processors is increased in all 
cases, a fact that has been anticipated from the discussion 
in Section FlI Bl Since the slope for the 6-3 IG** case is 
smaller than that for the 6-31G case, this explains the 
slight increase in the overall parallel QCTC performance 
of the 6-31G** case over the 6-31G case, as shown in 
Fig. [2 

We note that our results of a parallel QCTC speedup 
of 7.80 (with the 6-31G and 6-31G** basis sets) with 8 
processors compares favorably with the speedup of about 
6.0 of Sosa et aL[33, which is for an entire single-point 
energy calculation with RHF/3-21G. 

Similar scaling tests have been performed on a 
110- molecule and 200- molecule water clusters with 
RBLYP/6-31G** at a GOOD accuracy level. The resuh 
of the scaling tests is shown in Fig. O It is found that 
the ET speedups are rather good for both cases. The 
overall parallel QCTC speedups are 80.3 and 85.5 for 
the 128-processor calculations for 110-molecule and 200- 
molecule water clusters, respectively. The decrease in 
parallel QCTC efficiency is again due to the high tTs/tTT 
ratio at the 128-processor level. These ratios are 28.8% 
and 26.6% for the 110-molecule and 200-molecule water 
clusters, respectively (see Fig. 12). 
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FIG. 2: The ratio of the time to build the density tree (irs) 
to the time to traverse the density tree (tTr), as a function 
of the number of processors. 
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FIG. 4: Scaling of parallel QCTC on 110-molecule water 
cluster with RBLYP/6-31G** on GOOD and TIGHT accuracies. 
Speedups are relative to a 2-processor calculation. 
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FIG. 3: Scaling of parallel QCTC on 2 cluster of water 
molecules with RBLYP/6-31G**. Speedups are relative to 
a 2-processor calculation. 



To investigate the performance of parallel QCTC at 
a higher accuracy level, we have performed the scaling 
tests on a 110-molecule vifater cluster but with TIGHT 
accuracyll^. The results for both GOOD and TIGHT accu- 
racies are presented in Fig. 0] for comparison. It is seen 
that the ET speedup is better for the TIGHT case than for 
the GOOD case, which is anticipated since increasing the 
accuracy level increases the effective granularity, which 
leads to a better performance in ET partitionjZ^. Over- 
all parallel QCTC increases its efficiency from GOOD to 
TIGHT, which is due mainly to a decrease in the ItbI'^tt 
ratio, as shown in Fig. |S1 

Finally, for the periodic systems Fig. |S1 shows that 
the overall parallel QCTC with RPBE/6-31G** on GOOD 
accuracy is excellent. At the 128-processor level, the 
1x1x1 (5-HMX (168 atoms per simulation cell) delivers 
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FIG. 5: The ratio of the time to build the density tree (Itb) 
to the the time to traverse the density tree {tTx), as a function 
of the number of processors, for llO-molecule water cluster 
(RBLYP/6-31G**) calculations on GOOD and TIGHT accura- 



104.0 fold speedup, while the 2 x 1 x 2 PETN (232 atoms 
per simulation cell) delivers 100.0 fold speedup. These 
performances are better compared to the llO-molecule 
(a speedup of 80.3) or 200-molecule (a speedup of 85.5) 
water cluster calculations. This is due to the smaller 
tTs/tTT ratio (see Fig. [21 for the periodic cases com- 
pared to the finite cases, which mainly results from the 
increase in the time spent in the tree traversal part (e.g. 
87.7 and 64.0 sees for the 2 x 1 x 2 PETN and 200-molecule 
water cluster, respectively). 
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FIG. 6: Scaling of parallel QCTC on S-HMX and PETN 
with RPBE/6-31G**. Speedups are relative to a 2-processor 
calculation. 



density tree for the matrix element calculation. Equal 
time exploits the temporal locality between SCF itera- 
tions to overcome strong spatial irregularities. It is ex- 
pected that ET should retain this property between ge- 
ometry steps in an optimization or molecular dynamics 
run. The efficiency of the ET partition ranges from 91 
- 98 % for all test cases presented in this work at the 
128-processor level. The overall QCTC speedup, how- 
ever, ranges from 63 - 81 % overall efficiency at the 
128-processor level with fine grained parallelism. The 
decrease in efficiency is mainly due to the parallel tree 
build process. While the current simple implementation 
of the parallel tree build should eventually be replaced by 
a more sophisticated version, the current implementation 
has enabled us to run routine calculations to address a 
wide range of interesting problems 65. 67] . 
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